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Abstract. We report on the analysis of selected single source data sets from the 
first round of the Mock LISA Data Challenges (MLDC) for white dwarf binaries. We 
implemented an end-to-end pipeline consisting of a grid-based coherent pre-processing 
unit for signal detection, and an automatic Markov Chain Monte Carlo post-processing 
unit for signal evaluation. We demonstrate that signal detection with our coherent 
approach is secure and accurate, and is increased in accuracy and supplemented 
with additional information on the signal parameters by our Markov Chain Monte 
Carlo approach. We also demonstrate that the Markov Chain Monte Carlo routine is 
additionally able to determine accurately the noise level in the frequency window of 
interest. 
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1. Introduction 

The data obtained from LISA [1] will contain a large number of white dwarf binary 
systems across the whole observational window J2J. At frequencies below ~ 3 mHz the 
sources are so abundant that they produce a stochastic foreground whose intensity 
dominates the instrumental noise [3J. The closer (and louder) sources will still be 
sufficiently bright to be individually resolvable. Above ~ 3 mHz the sources become 
sufficiently sparse in parameter space (and in particular in the frequency domain) that 
the detectable sources become individually resolvable. The identification of white dwarfs 
in the LISA data set represents one of the most interesting analysis problems posed by 
the mission: the total number of signals in the data set is unknown, the effective noise 
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level affecting the measurements is not easily estimated from the data streams, and 
there is a large number of overlapping sources to the limit of confusion. 

Bayesian inference provides a clear framework to tackle such a problem [H [5j E]. 
Some of us have carried out exploratory studies and "proof of concept" analyses 
on simplified problems that have demonstrated that Bayesian techniques do indeed 
show good potential for LISA applications [HI [101 [12]. Similarly other authors have 
successfully implemented techniques using Bayesian inference [18] [T71 [16] . In this paper 
we present the first results of an end-to-end analysis pipeline developed in the context of 
the Mock LISA Data Challenges that has evolved from our earlier work. This pipeline 
is applied to the simplest single-source challenge data sets 1.1.1a and 1.1.1b and all the 
results presented here are obtained after the release of the key files. In a companion 
paper [19], we present results that we have obtained for the analysis of the data sets 
containing gravitational radiation from a massive-black-hole binary inspiral. Our group 
submitted an entry for the MLDC analysing the blind data set 1.1.1c [131 [H] : however 
that result suffered from the fact that the pipeline was not complete, the analysis code 
was inefficient and we encountered hardware problems with the Beowulf cluster used to 
perform the analysis. 

The results that we present here are obtained with a two-stage end-to-end analysis 
pipeline: (i) we first process the data set with a grid-based coherent algorithm to identify 
candidate signals; (ii) we then follow up the candidate signals with a Markov Chain 
Monte Carlo code to obtain probability density function on the model parameters. Our 
method differs from other MCMC methods that have been proposed and applied to 
the MLDC data in the context of white dwarf binaries [HI QH [16] : the MCMC is not 
used to search, but only in the final stage of the analysis to produce posterior density 
functions of the model parameters. The noise spectral level is included as one of the 
unknown parameters and is estimated together with the parameters of the gravitational 
wave source(s). 

2. Analysis method 

In this section we describe the two stage approach that we have adopted for the analysis. 
The signal produced by a white dwarf binary system is modelled as monochromatic in 
the source reference frame, following the conventions adopted in the first MLDC [3 El [9]. 
It is described by 7 parameters: ecliptic latitude $ e and longitude <p e , inclination l and 
polarisation angle frequency at a reference time fo and corresponding overall phase 
$o an d amplitude A. 

The data distributed for the MLDC are the three TDI vl.5 Michelson observables 
X, Y and Z% From those we construct the two orthogonal TDI outputs 



A = (2X-Y-Z)/3 
E=(Z-Y)/V3 



(1) 
(2) 



X In our MCMC analysis we use the data set produced using the LISA Simulator. 
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by diagonalizing the noise covariance matrix following the procedure presented in |23j. 
The noise affecting the channels A and E is uncorrelated and described by the one-sided 
noise spectral density S n (f). We model the LISA response function in the low frequency 
limit in order to improve the computational efficiency of our analysis. 



2.1. First stage: Grid based search 

The first stage of the pipeline consists of a fast search of the data for the best 
matched filter based on the well-known jF-statistic algorithm, first developed for triaxial 
pulsar signals in the context of ground-based observations [20] • This exploits the Fast 
Fourier Transform to perform matching in the frequency domain to templates which are 
generated at an array of fixed points in the parameter space. 

The data from an individual detector in the frequency domain d(f) is supposed to 
contain a signal plus Gaussian noise, d(f) = h(f) + n(f). We define the logarithmic 
likelihood as a measure of match, as given by logL « (d\h) — \{h\h) with (-|-) denoting 
the scalar product as defined in [20J. A single signal in the ^-"-statistic algorithm is 
re-parameterised as a linear function of four orthogonal variables, and the frequency 
/o- The detection statistic is based on four parameters A^, B^, and Dp, found by 
integrating over the response functions a(t) and b(t) to the two polarisation states of 
the gravitational wave signal [20] . 



'"Tobs 

Aj: — — ^— I a{t) 2 dt (3) 



T, 



obs JO 



2 



Bf = t— b{tfdt (4) 

Jobs Jo 

O = — - / a(t)b(t)dt, (5) 



T bs Jo 

D T = A T B T - C% (6) 

T Q bs denotes the total observed time for the data set being analysed. The optimal 
detection statistic 2JF, which is pre- maximised over the nuisance parameters ho, <Po 
and ip is 

2J7 = 8 B T \F a \ 2 + A?\F b \ 2 - 2Q x TZ(F a F b ) 

S n (f)T bs Djr 

F a and Fj, are the demodulated Fourier transforms of the data, 

F a = I d{t)a{t)e' im dt; F b = / d{t)b{t)e~ im dt, (8) 
Jo Jo 

$(£) is the phase of the gravitational wave signal, as is described in [22J. 

As the LISA array moves in space, the frequency fo is affected by Doppler 

modulations. This modulation changes with differing position of the source in the sky, 

implying the need to recalculate the modulations and thus a(t) and b(t) for each sky 

position that is tested - a significant factor in the performance of this approach. The 

differing modulation structure however also allows us to estimate the location of the 
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source in the sky by maximising the IT value. The resolution possible on the sky with 
this method is not as good as from a full Bayesian posterior probability calculation as 
performed in the parameter estimation stage, as shown in an example for Challenge 
1.1.1a in figure [H Nevertheless, since this statistic can be computed fairly quickly it 
serves as a useful way of finding initial values to feed into the MCMC routine, as adopted 
within the pipeline. The resolution achievable on the sky increases with frequency, which 
implies that the mismatch between filter and signal falls off more rapidly at higher 
frequencies, requiring a greater number of templates to cover the sky. Therefore for 
challenge 1.1.1b at / ~ 3mHz a sky grid of size 5,752 points was used, in comparison 
with 765 points for challenge 1.1.1a at / ~ 1 mHz. 

The ^-"-statistic search was implemented using the LIGO "Lalapps" suite of software 
[24], in which the pulsar search code was modified by Reinhard Prix and John Whelan 
to use the LISA response function for the TDI variables X, Y, and Z [2T]. These input 
data streams were given in the form of Short Fourier Transforms, each of length one day, 
created from the MLDC1 challenge data. For each challenge the full specified range of 
frequencies was searched for the signal as it would be in a blind search. The code was 
run on a single CPU and executed in a few hours, with the run-time increasing at higher 
frequency due to the higher resolution of sky and frequency grid that had to be used. 
The candidate chosen to pass to the MCMC stage was simply that which triggered the 
highest value of 2T. 



2.2. Second stage: Markov Chain Monte Carlo follow-up 

According to Bayes' theorem, the posterior probability, p(fh\d) of a model m given the 
data d depends on the prior distribution p(m), containing the information known before 
the analysis, the likelihood L(d\m) of the model and a normalisation factor p(d) 

, . ~ L(d\m)p(rh) 
p(m\d) = 1 1 -f> ; (9) 
p(d) 

The posterior probability density function shows the joint probability density of given 
values of parameters of the model rh, conditional on the data d. 

We implemented Bayes' theorem using data in the form of TDI variables A and E 
and modelled our template according to the Long Wavelength Approximation directly 
in the Fourier domain [25] to gain computational speed. The logarithmic likelihood 
L(d\m) in this stage explicitly included its dependence on the one-sided noise spectral 
density S n (f) 

log L(d\m) = const. — -logS n (f) — (d—h\d — h), (10) 

shown here for either A or E, with the combined likelihood as sum of the individual 
likelihoods. We restricted our analysis to a sufficiently narrow frequency window in 
order to be able to approximate the noise spectral density as constant, S n (f) = S . This 
window was set as the interval in frequency that contains at least 98% of the power of our 
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2F as a function of sky position, at a frequency 0.001 063 Hz 




1 2 3 4 5 6 

Ecliptic Longitude 



Figure 1. The variation of 2 J- values for the search for unknown signal 1.1.1a, as a 
function of sky position, parameterised by ecliptic latitude [3 and longitude A. The 
distribution is multi-modal and non-Gaussian, and has a poor resolution in comparison 
with that can be achieved with the MCMC and a Bayesian likelihood, but by finding the 
maximum it serves well as a starting point for the more refined parameter estimation 
below. 

model rh, with the interval's upper and lower limits given by /± (2/year)(5 + 2TrfoAU /c) 
|25j . So is therefore an additional parameter to be inferred within the model rh in Eq. [TUJ 
We implemented an automatic Random Walk Metropolis sampler (Stroeer & 
Vecchio 2007, in. prep.) to sample from the posterior probability density function in 
form of a Markov chain. Metropolis sampling eliminates the need to explicitly calculate 
the normalisation constant in Bayes' theorem, and the evolving Markov chain gives 
easy access to joint as well as marginalised posterior density distribution. The sampler 
was started from the parameter set which triggered the highest value of IT in our grid 
based coherent run of the analysis (see former section). The automated function of the 
Metropolis sampling is achieved by controlling the sampling step-size with adaptive 
acceptance probability techniques [26]. The sampler therefore does not depend on 
assumptions about the signal in the data set in order to perform successfully and reliably; 
it develops a suitable algorithm and approach by itself based on the properties of the 
likelihood as found on the fly, in the initial steps of the sampler. The length of our 
Markov chain was pre-set to 10 6 , with the initial 10 4 chain states discarded as the 
"burn-in" phase of our sampler. The runtime for one data analysis run is 5 hours on a 
single 2 GHz CPU on the Tsunami cluster of the University of Birmingham. 
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Figure 2. The marginalised posterior probability density functions of the eight 
unknown parameters - the seven parameters that describe the signal and the noise 
spectral density So - for the the challenge data set 1.1.1a. The vertical black solid 
line denotes the true value of the parameter (for the polarisation angle the true value 
modulo 7r/2), and the grey dashed line the initial value for the MCMC analysis as 
determined by the template of the first-stage that produces the maximum value of the 
^"-statistic. In the case of the noise spectral density the first stage of the analysis does 
not provide an estimate; the true value of this parameter is taken to be the value of 
the instrumental noise spectrum used to generate the data set and provided in [5]. 
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Figure 3. The marginalised posterior probability density functions of the eight 
unknown parameters for the the challenge data set 1.1.1b. Labels are as in Figure 
2. 



3. Results 

We found that the most promising candidate signal from the jF-statistic search already 
matched the true embedded signal to high accuracy, particularly in frequency and 
sky location. Our MCMC sampler, as a post-processing unit, thus only needed 1000 
iterations to burn in and to establish a reliable sampling from the posterior. The 
marginalised posteriors are shown in Figs. [2] and [3j We found, as seen in latter 
figures, that the MCMC sampler further refined the initial guesses from the JF-statistic, 
as measured by the absolute difference between the true value of a given parameter 
and the median of the marginalised posterior recovered for that parameter. Table Q] 
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Table 1. Details about the results from Challenge 1.1.1a and Challenge 1.1.1b. So, 
the constant one-sided noise spectral density within our narrow frequency window, is 
compared to the true one sided noise spectral density at the true frequency of the 
signal, 'J/ is given modulo ir/2. Intg denotes the minimum interval to include 90% of 
MCMC states for given parameter, Amode denotes the absolute difference between the 
true value of a signal parameter and the mode of its recovered posterior; Amedian and 
Amean denote the equivalent absolute difference for median and mean of the posterior 
respectively; a denotes the sampled standard deviation of the posterior as derived 
from the median. We further quote the signal-to-noise ratio (SNR) for a template 
using the true values of the source and the recovered values of the data analysis run, 
as derived from the median of the individual posterior distributions, and the correlation 
C between these two templates. 
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shows details of the statistics of recovered posterior distributions. We highlight that 
the majority of the true values of the parameters are within one standard deviation 
of the median of the posterior, with a small percentage within two sampled standard 
deviations. In addition, every true value of a parameter of the signal is within the 
minimum interval of the posterior to cover 90% of all MCMC state values. Recovered 
signal-to-noise ratios are measured as SNR = (s\h) / a/ (h\h), and the match C = 
(/itrue I ^med) / s/ (^true | ^true) (^med | ^med) between a template constructed from the true 
values and a template from the median values of the individual posterior distributions, 
yielding a correlation that is always higher than 0.97. Noise levels are determined 
accurately and within 1 to 1.5 sampled standard deviations. Nevertheless we note that 
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our run on Challenge 1.1.1a shows a lower match and higher differences between true 
value and recovered value of parameters as compared to the run on Challenge 1.1.1b. It 
also exhibits tailing posterior distributions in inclination and amplitude, although the 
SNR of Challenge 1.1.1a is twice the value of Challenge 1.1.1b. 

4. Conclusions 

We have presented a new approach to LISA data analysis in the form of an end-to-end 
pipeline. We first detected and identified candidate signals in the LISA data stream with 
a grid-based coherent algorithm, and then post-processed the most promising candidate 
signals with an automatic Markov Chain Monte Carlo code to obtain probability 
densities for the model's parameters. We demonstrated successful identification and 
post-processing of the signals from the double white dwarf single source MLDC data 
sets 1.1.1a and 1.1.1b. Furthermore, the automatic Markov Chain Monte Carlo code 
successfully identified the noise level within a small frequency window of interest in 
these data sets. We note that a parallel approach to the data analysis of binary inspiral 
signals is being developed by Rover et al, with a Markov Chain Monte Carlo method 
that can successfully post-process a candidate signal generated from the true parameters 
of the signal. Signal detection in a pre-processing stage is currently being tested within 
parallel tempered MCMC methods and/or time-frequency analyses [19J. 

We identify two prominent and promising features of our pipeline: its ability to 
determine good initial conditions for the MCMC and its ability to run the MCMC 
automatically. As we have demonstrated in this paper, the width of the marginalised 
posterior density for the frequency parameter is extremely narrow. It is therefore vital 
that the initial estimate of the frequency is within this region, as the almost flat structure 
of the posterior PDF outside this region gives little to no information on the location of 
the peak. The chances of finding the mode through a random sampling are decreased 
further still with a larger prior range for the parameter. Adding an ^-"-statistic search as 
the first stage in the pipeline solves this problem, since the frequency and position in the 
sky are recovered very accurately, to within the limits of the posterior probability region 
of interest, before the MCMC performs post-processing and parameter estimation. The 
automatic feature of the MCMC ensures a successful post-processing for the other 
astrophysical parameters that may have been located outside the posterior region of 
interest by the jF-statistic approach, as in the case for the amplitude of Challenge 
1.1.1a. Convergence is aided by the ability of our code to increase or decrease sampling 
step-sizes according to its experience of the sampling quality of the posterior during the 
burn-in phase. 

We are working on an extension of the pipeline as shown in this document to 
successfully tackle multi-source data sets, required for the second round of the MLDC. 
Current work includes the exploration of our grid-based coherent search on such data 
streams in order to automatically identify the most promising individual candidate 
signals, and the implementation of an automatic Reversible Jump Markov Chain Monte 
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Carlo routine (e.g. as already demonstrated in [TD]) to find the trans-dimensional 
probability density functions of the parameters of an unknown total number of signals. 
We highlight that the noise level determination presented here already serves as a key 
ingredient to round 2, where the simulation of a galactic white dwarf binary population 
introduces additional confusion noise levels from unresolvable sources. 
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